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Abstract 



The ability to efficiently find relevant subgraphs and paths in a large graph to a 
given query is important in many applications including scientific data analysis, 
social networks, and business intelligence. Currently, there is little support and 
no efficient approaches for expressing and executing such queries. This paper 
proposes a data model and a query language to address this problem. The 
contributions include supporting the construction and selection of: (i) folder 
nodes, representing a set of related entities, and (ii) path nodes, representing a 
set of paths in which a path is the transitive relationship of two or more entities 
in the graph. Folders and paths can be stored and used for future queries. 
We introduce FPSPARQL which is an extension of the SPARQL supporting 
folder and path nodes. We have implemented a query engine that supports 
FPSPARQL and the evaluation results shows its viability and efficiency for 
querying large graph datasets. 



1 Introduction 



Graph is a generic structure for representing data in many domains includ- 
ing business intelligence, scientific data analysis, provenance, bibliographic net- 
works, and social networking [1 ]. With enormous amount of data available, the 
resultant graphs are very large. An example of this is the case of Web and so- 
cial network graphs, which may contain millions of nodes. The need for efficient 
approaches for querying and analyzing these graphs is emergent. In particular, 
manipulating, querying, and analyzing graphs to discover new knowledge is of 
high interest [T]. 

Among various types of queries on data graphs, those which return a graph, 
a set of subgraphs or a set of paths in the large graph are gaining attention. 
One such query is finding the influence graph of a paper in a bibliographic graph 
through analyzing the citation of the paper in the graph. As another example, 
we may want to find a set of related activities in a business process graph, as 
an informal description of the process may be available in the form of a process 
graph. There is a need for graph representation models and efficient approaches 
for expressing and executing these types of queries. 

Among languages for querying graphs, SPARQL is a declarative query lan- 
guage, an official W3C standard and widely used for querying and extracting 
information from directed-labeled RDF graphs [TH]. It is based on a powerful 
graph matching mechanism that allows binding variables to components in the 
input graph and supports conjunctions and disjunctions of triple patterns. In 
addition, operators akin to relational joins, unions, selections, and projections 
can be combined to build more expressive queries. However, SPARQL does not 
support the construction and retrieval of subgraphs. Also paths are not first 
class objects in SPARQL [13 01]. 

Addressing this problem is challenging, as there is a need for a comprehen- 
sive, scalable, and efficient query language for graph analysis. The language 
should be native to graphs, general enough to meet the heterogeneous nature of 
real world data, and declarative [T]. In this paper, we present an approach for 
representing and querying graphs. The main contributions of the paper are as 
follows: 

• We propose a graph data model that supports structured and unstructured 
entities, and introduces folder and path nodes as first class abstractions. 
A folder node contains a collection of related entities, and a path node 
represent the results of a query that consists of one or more paths in the 
graph (a path is defined based on a transitive relationship between two 
entities) . 

• We define the FPSPARQL query language, a Folder-Path enabled exten- 
sion of the SPARQL, to manipulate and query entities, and folder and 
path nodes. 

• We describe the implementation of a query engine supporting FPSPARQL, 
and evaluate our query engine over large datasets. 

The remainder of this paper is organized as follows: We present the data 
model in section [2j and illustrate the manipulation part of the data model in 
section [3l In section [4] we discuss the query engine implementation. Section 
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Figure 2.1: Representation of graph, Folder, and Path. 



[5] presents a motivating scenario as an experiment and evaluates the proposed 
query engine. Section [5] presents related work. Finally, we conclude the paper 
with a prospect on future work in Section [71 

2 Graph Data Model, Folders, and Paths 

We define a graph data model for organizing a set of entities as graph nodes and 
entity relationships as edges of the graph. An entity is a data object that exists 
separately and has a unique identity. This data model supports: (i) structured 
and unstructured entities; (ii) folder nodes, which contain entity collections. A 
folder node represent the results of a query that returns a collection of related 
entities; and (hi) path nodes, which refer to one or more paths in the graph, 
which are the result of a query. A path is the result of the transitive relationship 
between two entities. Entities and relationships are represented as a directed 
graph G = (V, E) where V is a set of nodes representing entities, folder or path 
nodes, and E is a set of directed edges representing relationships between nodes. 

2.1 Entities 

Entities could be structured or unstructured. Structured entities are instances 
of entity types. An entity type consists of a set of attributes. Unstructured 
entities, are also described by a set of attributes but may not conform to an 
entity type. This entity model offers flexibility when types are unknown and 
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take advantage of structure when types are known. For the sake of simplicity, 
we assume all unstructured entities are instances of a generic type called ITEM. 
ITEM is similar to generic table in [18] . We store entities in the entity store. 

Example 1. Consider the bibliographical graph in Figure 1(a). In this graph 
we have entity types such as author, paper and venue. The graph in Figure 
1(b) illustrates the creation (i.e. ancestry relationships) of 'paperl'. 'paperl' 
and 'documentl' are structured entities, 'filel' is an unstructured entity with 
unknown entity type. The sample entity store in Figure 1(c) contains all the 
entities in this graph. The graph store in Figure 1(d), contains the directed 
links between entities. 

2.2 Relationships 

A relationship is a directed link between a pair of entities, which is associated 
with a predicate defined on the attributes of entities that characterizes the rela- 
tionship. A relationship can be explicit, such as authoredBy in 'paper authoredBy 
author' in a bibliographical network. Also a relationship can be implicit, such 
as a relationship between an entity and a larger (composite) entity that can be 
inferred from the nodes. 

2.3 Folder nodes 

A folder node contains a set of entities that are related to each other. In other 
words, the set of entities in a folder node is the result of a given query that 
require grouping graph entities in a certain way. A folder node creates a higher 
level node that other queries could be executed on top of it. Folders can be 
nested, i.e., a folder can be a member of another folder node, to allow creating 
and querying folders with relationships at higher levels of abstraction. A folder 
may have a set of attributes that describes it. A folder node is added to the 
graph and can be stored in the folder store to enable reuse of the query results 
for frequent or recurrent queries. 

Example 2. As an example of a relationship, let us consider a correlation condi- 
tion for two entities defined as a binary predicate over attributes of the entities. 
We call two entities correlated if the predicate is evaluated to true. Consider 
the correlation condition x. venue— 'CAiSE' where x is an instance of type paper. 
This query, groups set of papers published in 'CAiSE' conference. As illustrates 
in the Figure 1(a) the result of this query is the set {'paperl', 'paper3'}. We 
add a folder node to the original graph, and store the result of this query in the 
folder store (Figure 1(e)). For this purpose, we filter all the tuples in the graph 
store (Figure 1(d)) whose column 'nodc-from' is 'paperl' or 'paper3'. Properties 
of this folder will be stored in the entity store (Figure 1(c)). In the folder store, 
the nodes under the column 'subject' are the members of this folder. 

2.4 Path nodes 

A path is a transitive relationship between two entities showing the sequence of 
edges from the start entity to the end. This relationship can be codified using 
regular expressions [2J [5] in which alphabets are the nodes and edges from the 
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graph. We define a path node for each query that results in a set of paths. 
We use existing reachability approaches to verify whether an entity is reachable 
from another entity in the graph. Some reachability approaches (e.g. all-pairs 
shortest path [5]) report all possible paths between two entities. We define a 
path node as a triple of (Vstart, V en di RE) in which V s t a rt is the start node, V en d 
is the end node and a regular expression RE. We store all paths of a path node 
in the path store. 

For example, in a bibliographic graph, one possible query that results in a 
set of paths in the graph is "find all conferences for papers citing a given paper" . 
Such a query will help to understand which conferences cite papers from a given 
conference. The details of such a query is a set of paths from the current paper 
to the publication venue of papers citing the given paper. In cases, where the 
second entity of a target path query is not given, the query requires a maximum 
length to limit the search for matching end entities within that maximum length 
from the start entity. 

Example 3. Consider the bibliographical network presented in Figure 1(a). As- 
sume we are interested in finding occurrences of following pattern: 'paper2' cited 
by 'paperl' possibly indirectly (follow the red edges in the figure). This path 
can be written as regular expressions, starting with "paper (citedBy paper)+". 
The plus sign indicates that there is one or more of the preceding element. The 
result of this example stored in a sample path store presented in Figure 1 (f ) . 

Example 4- Consider the historicalgraph presented in Figure 1(b). The ancestry 
relationships found in provenancqj form a directed graph, i.e. historical graph. 
When an object A is found to have been derived from some other object B, we 
say that there is an ancestry path between A and B [12]. Figure 1(b) illustrates 
the ancestry path between 'paperl' and 'filel' (follow the red edges in the figure). 
Ancestry paths through historical graphs form the basis of many provenance 
queries. 

3 Data Manipulation 

Our graph-based data model and query requirements are very similar to those 
in the SPARQL query language [15] . Thus, we decided to base our language 
on SPARQL. We support two levels of queries in FPSPARQL: (i) Graph- level 
Queries: at this level we use SPARQL to query graphs; and (ii) Node-level 
Queries: at this level we propose an extension of SPARQL to construct and 
query folder nodes and path nodes. 

3.1 Graph-level Queries 

SPARQL is an RDF query language, standardized by the World Wide Web Con- 
sortium, for semantic web. SPARQL contains capabilities for querying required 
and optional graph patterns along with their conjunctions and disjunctions. 
SPARQL also supports extensible value testing and constraining queries. The 

1 Provenance I17| is the process of recording events happening in digital environments which 
generates the documented history of information items' creation. 
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Figure 3.1: Representation of the graph- level query proposed in example 4. 



results of SPARQL queries can be results sets or RDF graphs. A basic SPARQL 
query has the form: 

select ?variablel ?variable2 . . . 
where { 

patternl . pattern2. ... 
} 

Each pattern consists of subject, predicate and object, and each of these can 
be either a variable or a literal. The query specifies the known literals and 
leaves the unknowns as variables. To answer a query we need to find all pos- 
sible variable bindings that satisfy the given patterns. We use the '@' symbol 
for representing attribute edges and distinguishing them from the relationship 
edges between graph nodes. Example 5 presents a sample graph-level query. 



Example 5. Figure 3.1(b) depicts a sample SPARQL query over the sample 



RDF graph of Figure 3.1(a) to retrieve the web page information of the author 



of a book chapter with the title " Querying RDF Data" . 



3.2 Node-level Queries 

Standard SPARQL querying mechanisms is not enough to support querying 
needs of FPSPARQL and its data model. In particular, SPARQL does not 
support folder nodes and querying them natively and such queries needs to 
be applied to the whole graph. In addition, querying the result of a previous 
query becomes complex and cumbersome, at best. Also path nodes are not first 
class objects in SPARQL [51 [12]. We extend SPARQL to support node- level 
queries to satisfy specific querying needs of our data model. Node-level queries 
in FPSPARQL include two special constructs: (a) CONSTRUCT queries: used 
for constructing folder and path nodes, and (b) APPLY queries: used to ease 
the requirement of applying queries on folder and path nodes. 



Folder Node Construction. 

To construct a folder node, we introduce the FCON STRUCT command. This 
command is used to group a set of related entities or folders. A basic folder 
node construction query looks like this: 
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f construct <Folder_Node Name> 
[ select ?varl ?var2 ... I 

(Folder_Nodel Name, Folder_Node2 Name, ...) ] 
where { patternl. pattern2. ... } 

A query can be used to define a new folder node by listing folder node name 
and entity definitions in the fconstruct and select statements, respectively. Ex- 
ample 6 represents such a query. Also a folder node can be defined to group a 
set of folder nodes. A simple example of such a query represented in example 7. 
A set of user defined attributes for this folder can be defined in the where state- 
ment. 

Example 6. Construct a folder node (and name it CAiSEPapers) for the query 
represented in example 2. 

fconstruct CAiSEPapers as ?fn 
select ?p 
where { 

?fn Odescription 'set of . . . ' . 

?p ©type paper. 

?p publishedln 'CAiSE'. 

} 

In this example the variable ?p represent the papers published in 'CAiSE' 
conference. The variable ?fn represent the folder node to be constructed, i.e. 
'CAiSEPapers'. This folder node has a user defined attribute called 'descrip- 
tion'. 

Example 7. Consider 3 folder nodes 'SIGMOD08','SIGMOD09\ and 'SIG- 
MOD10' each representing accepted papers in SIGMOD conference 2008, 2009, 
and 2010. Construct a folder node ( i.e. 'SIGMOD') to group these folder nodes. 

fconstruct SIGMOD as ?fn 
(SIGM0D08 , SIGM0D09 , SIGM0D10) 
where { 

?fn Odescription 'set of related folder nodes'. 
} 

In this example the variable ?fn represent the folder node to be constructed, 
i.e. 'SIGMOD'. This folder node contains 3 folder nodes and has a user defined 
attribute 'description'. These folder nodes are hierarchically organized by part- 
of (i.e. an implicit relationship) relationships. 

Path Node Construction. 

We introduce the PCON STRUCT command to construct a path node. This 
command is used to discover transitive relationships between two entities and 
store it under a path node name. In general a basic path node construction 
query looks like this: 
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pconstruct <Path_Node Name> 

(Start Node, End Node, Regular Expression) 

where { 

patternl. pattern2. ... 
} 

A regular expressions can be used to define a transitive relationship between 
two entities, i.e. starting node and ending node. Attributes of starting node, 
ending node, and regular expressions alphabets (i.e. graph nodes and edges) 
can be defined in the where statement. Example 8 represents such a query. 

Example 8. Consider the bibliographical network presented in Figure 1(a). Con- 
struct a path node for the possible transitive relationship between 'paper2' and 
'paperl', to analyze the citations of 'paper2'. 

pconstruct p2plPath 

(?startNode,?endNode, (?e ?n)* ?citedByEdge (?n ?e)*) 
where { 

?startNode Oid p2 . 
?endNode @id pi . 
?n OisA entityNode. 
?e OisA edge. 
?citedByEdge OisA edge. 
?citedByEdge Olabel citedBy. 
} 

In this example TstartNode denotes 'paper2' and ?endNode denotes 'paperl'. 
Respectively, ?e and ?n denotes any edges and nodes in the transitive relation- 
ship between 'paper2' and 'paperl'. And TcitedByEdge denotes an edge labeled 
'citedBy' in the path node. The isA attribute denotes the class attribute of the 
entities (see Figure 1(c)). In the regular expression, parentheses are used to de- 
fine the scope and precedence of the operators and the asterisk indicates there 
are zero or more of the preceding element. This regular expressions matches 
the path node (i.e. p2plPath) 'paper2 citedBy papcr4 citedBy papcr3 citedBy 
paperl' (follow the red edges in the Figure 1(a)). 

Folder Node Queries. 

We introduce the APPLY command to retrieve information, i.e. by applying 
queries, from the underlying folder nodes. These queries can apply on one 
folder node or the composition of several folder nodes. Our model supports the 
standard set operations (union, intersect, and minus) to apply queries on the 
composition of several folder nodes. In general, a basic folder node query looks 
like this: 

[Folder Node I (Composition of Folder Nodes)] APPLY ( 
select ?variablel ?variable2 . . . 
where { 

patternl. pattern2. ... 
» 
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A graph-level query can be applied on a folder node (see example 9) or com- 
position of folder nodes (see example 10) by listing folder node or composition of 
folder nodes before apply command, and placing the query in parenthesis after 
apply command. 

Example 9. Consider the folder node CAiSEPapers constructed in example 6. 
We are interested in applying the query "retrieve the papers which authored by 
author 1" on this folder node. 

(CAiSEPapers) apply ( 
select ?p 
where { 

?p Otype paper. 
?p authoredBy ?a. 
?a ©type author . 
?a Oname 'authorl'. 
» 

In this example ?p denotes papers that fall inside CAiSEPapers folder node, 
and the query will apply on these papers. The result will be papers published 
in 'CAiSE' conference which authored by 'authorl'. 

Example 10. Consider that we have constructed two folder nodes CAiSEPapers 
(set of papers published in 'CAiSE' conference) and SIGMODPapers (set of 
papers published in 'SIGMOD' conference). We are interested in retrieving the 
papers that published in 'SIGMOD' or 'CAiSE' which authored by 'authorl'. 

(CAiSEPapers union SIGMODPapers) apply ( 
select ?p 
where { 

?p Otype paper. 
?p authoredBy ?a. 
?a Otype author. 
?a ©name 'authorl'. 
» 

In this example ?p denotes papers that fall inside both CAiSEPapers and 
SIGMODPapers folder nodes. The query "retrieve the papers that authored by 
authorl" will apply on these papers. 

Path Node Queries. 

This type of query is used to retrieve information, i.e. by applying queries, from 
the underlying path node. Path node queries are similar to folder node queries 
and use the same command, i.e. APPLY command. In general, a basic path 
node query looks like this: 

Path_Node_Name APPLY ( 

select ?variablel ?variable2 . . . 

where { 

patternl. pattern2. ... 
» 
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A graph-level query can be applied on a path node by listing path node 
name before apply command, and placing the query in parenthesis after apply 
command. Example 11 represents a simple example of such query. 

Example 11. Consider the path node p2plPath constructed in example 8. We 
are interested to find papers in the transitive relationship between 'paper2' and 
'paperl', that have the keyword 'SQL' in their titles. 

(p2plPath) apply ( 
select ?p 
where { 

?p Otype paper . 

?p Otitle ?t. 

Filter regex(?t , "SQL") . 

}) 

In this example ?p denotes papers that fall inside p2plPath path node. The 
query "retrieve the papers that have the keyword 'SQL' in their titles" will 
apply on these papers. 



4 Implementation 

The simplest way to store a set of RDF statements is to use a relational database 
with a single table that includes columns for subject, property and object. While 
simple, this schema quickly hits scalability limitations [20] . To avoid this we de- 
veloped a relational RDF store including its three classification approaches [2UJ : 
vertical (triple), property (n-ary), and horizontal (binary). The query engine is 
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implemented in Java (J2EE) and consists of two main layers (Figure l4"TTj) . data 
mapping and query engine. 

Data Mapping Layer. 

This layer is responsible for creating data element mappings between semantic 
web technology (i.e. Resource Description Framework) and relational database 
schema. We developed a workload-independent physical design by developing 
a Loader algorithm. This algorithm is responsible for: (i) validating the input 
RDF; (ii) generating the relational representation of triple RDF store, for manip- 
ulating and querying entities, folders, and paths; and (iii) generating powerful 
indexing mechanisms. 

Query Engine. 

The query engine consists of two layers: query mapping and query optimizer. 
Query Mapping Layer consists of a FPSPARQL parser (for parsing FPSPARQL 
queries based upon the syntax of FPSPARQL) and a schema-independent FPS- 
PARQL-to-SQL translation algorithm. This algorithm consists of: 

• SPARQL-to-SQL translation algorithm. We implemented a SPARQL-to- 
SQL translation algorithm based on the proposed relational algebra for 
SPARQL [5] and semantics preserving SPARQL-to-SQL query transla- 
tion [TJ. This algorithm supports Aggregate queries and Keyword Search 
queries. Figure 14.21 shows a SPARQL query, its translation into a rela- 
tional operator tree, and its equivalent SQL query which is generated by 
this algorithm. 

• Folder node construction and querying. We use the relational representa- 
tion of triple RDF store, to store, manipulate, and query folder nodes. 

• Path node construction and querying. To describe constraints on the path 
nodes, we reused the specification for regular expressions and filter ex- 
pressions proposed in CSPARQL \2[. We developed a regular expression 
processor which supports optional elements (?), loops (+,*), alternation 
( — ), and grouping ((...)) [5]. We provide the ability to call external graph 
reachability algorithms (see section I5T5|) for path node queries. 

To optimize the performance of queries, we developed four optimization 
techniques proposed in [SJ [SU1 13 : (i) selection of queries with specified varying 
degrees of structure and spanning keyword queries; (ii) selection of the smallest 
table to query based on the type information of an instance; (iii) elimination of 
redundancies in basic graph pattern based on the semantics of the patterns and 
database schema; and (iv) create separate tables (property tables) for subjects 
that tend to have common properties to reduce the self-join problem. 

5 Experimental Evaluation 

In this section we provide an experimental evaluation of FPSPARQL query 
engine. We utilized IBM DB2 as a back-end database. All experiments were 
conducted on a HP system with a 2.67Ghz Core2 Quad processor, 4 GBytcs 
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Figure 4.2: A SPARQL query, its translation into a relational operator tree, 
and its equivalent SQL query generated by our translation algorithm. 



of memory, and running a 64-bit Windows 7. We give an overview of the 
evaluation example and datasets used in section HTT1 We compared our system 
with HyperGraphDB (T3] (an open-source graph database) and present query 
running time measurements in section [5T2"1 We provide the ability to call external 
graph reachability algorithms for path node queries. We discuss the quality of 
finding paths by different approaches in subsection 15.31 

5.1 Evaluation Example 

Our example falls inside business intelligence domain and comes from our ex- 
perience on managing an online project-based course "e-Enterprise Projects" 
during 2009 and 2010. There are different people (e.g. students, mentors and 
lecturers) involved in this course. For example in semester 2-2009 we had 66 
people (60 students + 5 project mentors + 1 lecturer) involved in the course ac- 
tivities. During this semester, fifteen project groups (each group consists of four 
students) have been formed where each group has been allocated to one of the 
available projects. Each mentor has been allocated to supervise three projects. 
The development process of each project has gone through a sequence of pre- 
defined phases: brainstorming, requirements analysis, design phase, prototype 
implementation, testing and final product delivery. 

The activities of each project have been documented through a Web-based 
project management system which is equipped with many back-end modules 
such as: 1) Message Board to exchange message and open discussion topics 
between the project members. 2) Wiki System which is used to collaboratively 

1 www.cse.unsw.edu.au/~cs9323 
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Figure 5.1: Evaluation example. 



edit related documents to the activities of their project. 3) Blogging System 
where each user has his own blog to edit his own posts. 4) File Sharing System 
where project members can share access to different files and documents. 5) 
SVN Repository to synchronize the editing of the projects source code. Figure lOl 
depicts an illustration of our motivating scenario. 

The graph structure in our example is made up of entities such as artifact, 
process, and agent. An artifact is an immutable piece of state which has a digital 
representation in a computer system, e.g. a file. A process is an action or series 
of actions performed on or caused by artifacts. An agent is an entity which is 
capable of acting as a catalyst of a process. These entities are connected by 
one or more specific types of interdependency, such as 'process used artifact', 
'process wasTriggeredBy process', 'process wasControledBy agent', 'artifact was- 
GeneratedBy process', and 'artifact wasDerivedFrom artifact' (for more detail 
see [IT]). 



5.2 Query Execution Time 

We evaluated the performance of the FPSPARQL query engine compared to 
one of the well-known graph databases, the HyperGraphDB [TJ]. There is no 
query language for HyperGraphDB and querying is performed through special 
purpose APIs. These APIs are based on conditional expressions that a user 
creates, submits to the query system and receives a set of nodes as the result. 
We extracted and simulated over one million events (about 25 million triples) 
out of e- Enterprise course database (section 15.11) to generate a large RDF file 
(1.9 GBytc). It took 22.8 minutes to load the input RDF file into FPSPARQL 
relational RDF store. HyperGraphDB manages storage as a set of files in a 
directory. To create a database and load the same file as input into Hyper- 
GraphDB, we have implemented a loader. The loader took 52.2 minutes to 
load the input file. In the appendix we present FPSPARQL query samples that 
were useful for our e-Enterprise course collaborators. For each query expressed 
in English, we construct a FPSPARQL query and its equivalent SQL queries, 
generated by FPSPARQL-to-SQL translation algorithm. 

Figure I5T21 illustrates the query execution time for each FPSPARQL query, 
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its SPARQL equivalent, and HyperGraphDB API. Queryl is a folder node con- 
struction query. Queryl runs a bit faster on SPARQL, compared to FPSPARQL. 
The reason could be a small overhead for storing the folder in FPSPARQL query. 
Both SPARQL and FPSPARQL executed faster than HyperGraphDB. Query2 
is a folder node selection query. The execution time of FPSPARQL shows 
that applying queries on folder nodes, improves the query processing time of 
many complex queries. The equivalent SPARQL query should apply the con- 
dition on the whole graph which takes longer to execute. The execution time 
of HyperGraphDB is much better than SPARQL query, but not comparable to 
FPSPARQL query. Query3 is a folder node selection query. In FPSPARQL, the 
query applied on the composition of two constructed folder nodes. For Hyper- 
GraphDB we generate same folders as hypergraphs and applied a query on the 
composition of them. Figure I5T21 shows the better +performance of FPSPARQL 
compared to its equivalent SPARQL query and HyperGraphDB API. 

Query4 is a path node construction query. FPSPARQL provides the ability 
to call external graph reachability algorithms in path node queries (see Sec- 
tion l5.3|) . It took 15.7 minutes, for the FPSPARQL engine, to parse the regular 
expressions and explore potential paths. As the result one path was discov- 
ered. HyperGraphDB has APIs providing the traversal algorithm (breadth-first 
or depth-first). The performance for these APIs depends on the incidence in- 
dex and the efficient caching of incidence sets. We applied efficient index and 
caching to run queryl on HyperGraphDB. The query took 63.8 minutes to ex- 
plore potential paths. As the result one path was discovered. We stored the 
path (manually) as a hypergraph to use in query5. Query 4 is not supported in 
SPARQL query language. 

Query5 is a path node selection query. In FPSPARQL, the query applied 
on the path node constructed in query4. In HyperGraphDB, the query ap- 
plied on the path generated in query4 which stored manually as a hypergraph. 
HyperGraphDB does not support the automatic construction and selection of 
paths. Also it does not provide the ability to call external traversal algorithms. 
Query 5 is not supported in SPARQL query language. Figure I5T21 illustrates the 
performance of these queries. 

5.3 Graph Reachability Analysis 

We developed an interface to support various graph reachability algorithms [I] 
such as Transitive Closure, GRIPP, Tree Cover, Chain Cover, Path- Tree Cover, 
and Shortest-Paths [TT|. In general, there are two types of graph reachability 
algorithms [1 : (1) algorithms traversing from starting vertex to ending ver- 
tex using breadth-first or depth-first search over the graph, and (2) algorithms 
checking whether the connection between two nodes exists in the edge transitive 
closure of the graph. Considering G = (V, E) as directed graph that has n nodes 
and m edges, the first approach incurs high cost as 0(n+m) time which requires 
too much time in querying. The second approach results in high storage con- 
sumption in 0(n 2 ) which requires too much space. In this experiment, we used 
the GRIPP [22] algorithm which has the querying time complexity of 0(m — n), 
index construction time complexity of 0(n + m), and index size complexity of 
0(n + m). 
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Figure 5.2: Query Execution Times. 



6 Related Work 

A recent book pQ and survey [3] discuss a number of data models and query 
languages for graph data. Some of existing approach for querying and modeling 
graphs [9"lll0| focused on defining constraints on nodes and edges simultaneously 
on the entire object of interest, not in an iterative one-node-at-a-time manner. 
Therefore, they do not support querying nodes at highler levels of abstraction. 
Authors of [10] propose an Information Fragment as an abstraction for repre- 
senting a subgraph. They do not support querying information fragments. 

BiQL [5] is an SQL-based query language focused on the uniform treatment 
of nodes and edges and supports queries that return subgraphs. BiQL supports 
a closure property on the result of its queries meaning that the output of every 
query can be used for further querying. Compared to BiQL, in our work folders 
and paths are first class abstractions (graph nodes) and can be defined in a 
hierarchical manner, over which queries are supported. 

HyperGraphDB |13] is a graph database based on hypergraphs (a hyper- 
graph node is connected through an edge to all vertices that are contained in 
it). There is no query language for HyperGraphDB and querying is performed 
through special purpose APIs. HyperGraphDB builds on two prior approaches 
of Hypernode [TB] and GROOVY [TS] graph representation models which focus 
on representing objects and object schemas. GROOVY [T5] and Hypernode only 
support typed objects, and have no support for hypernode specific operations. 

SPARQL [19 is a declarative query language, an W3C standard, for querying 
and extracting information from directed labeled RDF graphs. SPARQL sup- 
ports queries consisting of triple patterns, conjunctions, disjunctions, and other 
optional patterns. However, there is no support for querying grouped entities. 
Paths are not first class objects in SPARQL (HUE]. PSPARQL [4] extends 
SPARQL with regular expressions patterns allowing path queries. SPARQLeR 
P3] is an extension of SPARQL designed for finding semantic associations (and 
path patterns) in RDF bases. In FPSPARQL, we support folder and path nodes 
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as first class entities that can be denned at several levels of abstractions and 
queried. In addition, we provide an efficient implementation of a query engine 
that support their querying. 

7 Conclusion 

In this paper, we presented a data model and query language for querying and 
analyzing large graphs specifically for analyzing groups of related entities. The 
data model supports structured and unstructured entities, and introduces folder 
and path nodes as first class abstractions. The query language, i.e. FPSPARQL, 
defined as an extension of SPARQL to manipulate and query entities, and folder 
and path nodes. We have developed an efficient and scalable implementation of 
FPSPARQL within the high-performance relational RDF storage and retrieval 
system. To evaluate the viability and efficiency of FPSPARQL, we have con- 
ducted experiments over large graph datasets. We compared the quality and 
speed of FPSPARQL with HyperGraphDB [13]. 

As future work, we plan to design a visual query interface to support users 
in expressing their queries over the conceptual representation of the graph in an 
easy way. Moreover, we plan to make use of interactive graph exploration and 
visualization techniques (e.g. storytelling systems [2T]) which can help users 
to quickly identify the interesting parts of a graph. We are also interested in 
the temporal aspects of graph analysis, as in some cases (e.g. provenance) the 
structure of the graph may change rapidly over time. 
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Appendix: FPSPARQL Queries 

In this section we present FPSPARQL query samples that were useful for our 
e-Enterprise course collaborators. For each query expressed in English, we 
construct a FPSPARQL query and its equivalent SQL query, generated by 
FPSPARQL-to-SQL translation algorithm. 

Query 1. [Folder Node Construction] Group all the events happened in the 
context of brainstorming phase during "semester 2, 2009", in a folder named 
"brainstorming09s2". Brainstorming phase start time is '19 July 2009' and end 
time is '8 August 2009'. 

FPSPARQL: 

f construct brainstorming09s2 as ?fn 

select ?e 

where{ 

?fn Qdescription 'related events. . . ' . 

?e Otype Event . 

?e Qtimestamp ?date. 

FILTER (?date > "2009-07-19" ~~xsd:date kk 

?date > "2009-08-08" "xsd:date). 

> 

SQL: 

OfolderlD <- generate a unique folderlD 

insert into NULLID . EntityStore 
(subject , predicate , object) 
values 

(OfolderlD , 'QName' , 'brainstorming09s2' ) ; 

insert into NULLID. F0LDERST0RE 

(folderid , subject , predicate , object) 

select OfolderlD as folderid , subject , predicate , object 
from NULLID . GraphStore 
where subject in 
(SELECT r.e AS e 
FROM ( 

Select esl. subject AS e, es2. object AS date 

From NULLID. EntityStore esl, NULLID . EntityStore es2 

Where esl .predicate = 'Otype' AND esl. object = 'Event' AND 



17 



es2 .predicate = 'Qtimestamp' AND esl. subject = es2. subject AND 
(DATE (SUBSTRING (es2 . ob j ect , 1 , 10 , C0DEUNITS32) ) > DATE (2009-07-19) 
AND 

DATE (SUBSTRING (es2 . ob j ect , 1 , 10 , C0DEUNITS32) ) > DATE (2009-08-08) ) 
) AS r ); 



Query 2. [Folder Node Selection] Return the list of artifacts that have been 
part of update events which are triggered by a comment event in the context 
of brainstorming phase during "semester 2, 2009", i.e. the folder we created in 
Queryl. 

FPSPARQL: 

(brainstorming09S2) apply ( 
select ?a 
where { 

?e Otype 'Event' . 

?e QactivityType 'update'. 

?e QArtif actName ?a. 

?e wasTriggeredBy ?x. 

?x Otype 'Event'. 

?x QactivityType 'comment'. 

}) 

SQL: 

SELECT r.a AS a FROM ( 

Select esl. subject AS e, es3. object AS a, fsl. object AS x 

From NULLID.EntityStore esl, NULLID . EntityStore es2, 

NULLID.EntityStore es3, NULLID . F0LDERST0RE fsl, 

NULLID.EntityStore es4, NULLID.EntityStore es5 

Where esl .predicate = 'Otype' AND esl. object = 'Event' 

AND es4. object = 'Event' AND es2 .predicate = 'QactivityType' 

AND es2. object = 'update' AND es3 .predicate = ' QArtif actName ' 

AND fsl .predicate in ( 

select subject 

from NULLID.EntityStore 

where predicate = 'OLabel' AND object = 'wasTriggeredBy' ) 

AND fsl.FolderlD in ( 

Select subject 

from NULLID.EntityStore 

where predicate = 'QName' AND object = 'brainstorming09S2') 
AND es5. object = 'comment' AND esl. subject = es2. subject 
AND esl. subject = es3. subject AND esl. subject = fsl. subject 
AND esl. object = es4. object AND es4. subject = es5. subject 
AND fsl. object = es4. subject AND esl .predicate=es4. predicate 
AND es2 .predicate = es5 .predicate ) AS r 
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Query 3. [Folder Node Selection] Return the list of users who were involved 
in updating an artifact during brainstorming and design phase of semester 1, 
2010. We construct two folders of all events that happened in the context of 
brainstorming phase (brainstorminglOsl), and design phase (designlOsl) during 
"semester 1 2010". 

FPSPARQL: 

(brainstorminglOsl union designlOsl) apply ( 
select ?u 
where { 

?e Otype 'Event'. 

?e OactivityType 'update'. 

?e QUseName ?u. 

}) 

SQL: 

SELECT r.u AS u FROM ( 

Select esl. subject AS e, es3. object AS u 

From NULLID.EntityStore esl, NULLID.EntityStore es2, 

NULLID.EntityStore es3 

Where esl .predicate = 'Otype' AND esl. object = 'Event' AND 

es2. predicate = 'OactivityType' AND es2. object = 'update' AND 

es3 .predicate = 'OUseName' AND esl. subject = es2. subject AND 

esl. subject = es3. subject AND esl. subject in ( 

select subject 

from NULLID . FOLDERSTORE 

where folderid in ( 

select subject from NULLID.EntityStore 

where predicate = 'Qname' and subject = 'brainstorminglOsl') 
union 

select subject 

from NULLID. FOLDERSTORE 

where folderid in ( 

select subject 

from NULLID.EntityStore 

where predicate = 'Qname' and subject = 'designlOsl') ) ) AS r 



Query 4. [Path Node Construction] Construct a path between the event that 
generates brainstorming document (brainDoc.doc), and the event that gener- 
ates design document (designDoc.doc) which were rendered by project4 mem- 
bers during semester 2, 2009. This path should contain the pattern of an event 
responding to a bug report in the Wiki. 

FPSPARQL: 

pconstruct myPathNode 

(?startNode,?endNode, (?e ?n)* ?e ?node ?e (?n ?e)* ) 
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where { 

TstartNode Otype 'Event'. 

TstartNode OactivityType 'generate'. 

TstartNode Oartif actName 'brainDoc.doc'. 

TstartNode OUserGroup 'project4' . 

TstartNode Otimestamp ?date. 

TendNode Otype 'Event'. 

TendNode OactivityType 'generate'. 

TendNode @artif actName 'designDoc.doc'. 

TendNode OUserGroup 'project4' . 

TendNode Otimestamp ?date. 

?n OisA 'entityNode' . 

?n Otype 'Event'. 

?n Otimestamp ?date. 

?e OisA 'edge' . 

?node Otype 'Event'. 

?node OactivityType 'response' . 

?node Olayer 'Wiki'. 

?node OlayerPart 'bug' . 

?node Otimestamp ?date. 

FILTER (?date > "2009-07-19" ~~xsd:date kk 
?date > "2009-11-04" "xsd:date). } 

SQL: 

A graph reachability algorithm used (see section 5) 



Query 5. [Path Node Selection] Return the list of artifacts that generated 
between the path constructed in Query4. 



FPSPARQL: 

(myPathNode) apply ( 
select ?a 
where { 

?e Otype 'Event' . 

?e OactivityType 'generate'. 

?e OArtif actName ?a. 

}) 



SQL: 



SELECT r.a AS a FROM ( 

Select esl. subject AS e, es3. object AS a 

From NULLID.EntityStore esl, NULLID . EntityStore es2, 

NULLID.EntityStore es3 

Where esl .predicate = 'Otype' AND esl. object = 'Event' AND 
es2 .predicate = 'OactivityType' AND es2. object = 'generate' 
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AND es3.predicate='@Artif actName' AND esl . subject=es2 . subject 

AND esl. subject = es3. subject AND esl. subject in ( 

SELECT subject 

from NULLID.PathStore 

Where Pathld in ( 

SELECT subject 

from NULLID. EntityStore 

Where predicate = ' toame ' AND object= 'myPathNode ' ) ) ) AS r 
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